General Notes

  • Any function that reads text should accept a pre-existing vocabulary to specify the token to id mappings; if no vocabulary is provided it should create one from scratch. What should happen if a vocabulary is provided but it does not contain entries for all tokens in the input? Have extensible and non-extensible vocabulary types? Have a separate option to control this behaviour? If the vocabulary is not extended, the parser can either fail or succeeds but discards the unmapped tokens. Again, how is this controlled or decided?

  • Stop word processing also changes vocabularies, so we need to be able to filter on a corpus to reduce it to a narrower vocabulary. This could be used to defer a decision on how to handle a pre-provided vocabulary when reading: The resulting corpus contains a vocabulary that may be larger than that provided, but all words common to both are encoded with the same index. The resulting corpus can then be filtered back down to the original vocabulary if desired.

http://gibbslda.sourceforge.net/#3.2_Input_Data_Format

Both data for training/estimating the model and new data (i.e., previously unseen data) have the same format as follows:

[M]
[document1]
[document2]
...
[documentM]

in which the first line is the total number for documents [M]. Each line after that is one document. [documenti] is the ith document of the dataset that consists of a list of Ni words/terms.

[documenti] = [wordi1] [wordi2] ... [wordiNi]

in which all [wordij] (i=1..M, j=1..Ni) are text strings and they are separated by the blank character.

Blei's LDA-C format, from http://www.cs.princeton.edu/~blei/lda-c

http://www.cs.princeton.edu/~blei/lda-c/readme.txt

Under LDA, the words of each document are assumed exchangeable. Thus, each document is succinctly represented as a sparse vector of word counts. The data is a file where each line is of the form:

 [M] [term_1]:[count] [term_2]:[count] ...  [term_N]:[count]

where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. Note that [term_1] is an integer which indexes the term; it is not a string.

The format of the docword.*.txt file is 3 header lines, followed by NNZ triples:

D
W
NNZ
docID wordID count
docID wordID count
docID wordID count
docID wordID count
...
docID wordID count
docID wordID count
docID wordID count

The format of the vocab.*.txt file is line contains wordID=n.